Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages

نویسندگان

  • Hercules Dalianis
  • Martin Rimka
  • Viggo Kann
چکیده

This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora. We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing six different languages. In order to compare how well different types of bilingual dictionaries covered the most common queries and terms on the website we tried a collection of ordinary bilingual dictionaries, a small manually constructed trilingual dictionary and an automatically constructed trilingual dictionary, constructed from the news corpus in the website using Uplug. The precision and recall of the automatically constructed Swedish-English dictionary using Uplug were 71 and 93 percent, respectively. We found that precision and recall increase significantly in samples with high word frequency, but we could not confirm that POS-tags improve precision. The collection of ordinary dictionaries, consisting of about 200 000 words, only cover half of the top 100 search queries at the website. The automatically built trilingual dictionary combined with the small manually built trilingual dictionary consists of about 2000 words and covers 27 of the top 100 search queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian

This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages. In order to compare how well different types of bilingual dictionaries covered the most common ...

متن کامل

Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages

Hallå Norden is a web site with information regarding mobility between the Nordic countries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extr...

متن کامل

To search and summarize in Scandinavia

Automatic text summarization is the method where a computer summarizes a text. A text is given to the computer and it returns a non-redundant shorter text. Text summarization can be used to summarize news in the Business Intelligence domain, automatically edit news in the news paper setting domain and summarize news down to a length suitable for SMS and WAP but also to summarize news before the...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Automatic Dictionary Construction and Identification of Parallel Text Pairs

When creating dictionaries for use in for example cross-language search engines, parallel or comparable text pairs are needed. Multilingual web sites may contain parallel texts but these can be difficult to detect. For instance, a multilingual website, Hallå Norden, contains information in five languages; Swedish, Danish, Norwegian, Icelandic and Finnish. Working with these texts we discovered ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009